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Abstract 

In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of 
the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. 
For example, we might have one set of molecular observations and one set of physiological observations on the same 
group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being 
a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, 
we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations 
from different views are driven by different sets of latent components, and each component can be a complex, high¬ 
dimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the 
components without having to model the other components. We show how to integrate RCA with stochastic gradient 
descent into a meta-algorithm for learning general models, and demonstrate substantial improvement in accuracy on 
several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn 
latent variable models when we don’t have samples from the true model but only samples after complex perturbations. 


1 Introduction 

A hallmark of modern data deluge is the prevalence of complex data that capture different aspects of some common 
phenomena. For example, for a set of patients, it’s common to have multiple modalities of molecular measurements 
for each individual (gene expression, genotyping, etc.) as well as physiological attributes. Each set of measurements 
corresponds to a view on the samples. The complexity and the heterogeneity of the data is such that it’s often not 
feasible to build a joint model for all the data. Moreover, if we are particularly interested in one aspect of the problem 
(e.g. patterns that are specific to a subset of genes that are not shared across all genes), it would be wasteful of 
computational and modeling resources to model the interactions across all the data. 

More concretely, suppose we have two sets (views) of data, U and V, on a common collection of samples. We 
model this as C/ = 81 + 82 and V = A82 + 83, where captures the latent component specific to U, 83 is specific to 
V, and 82 is common to both U and V and is related in the two views by an unknown linear transformation A. Each 
component 8i can be a complex, high-dimensional distribution. The observed samples from U and V are component¬ 
wise linear combinations of the unobserved samples from 8i. To model all the data, we would need to jointly model 
all three 8i, which can have prohibitive sample/computation complexity and also prone to model misspecification. 
Ideally, if we are only interested in the component that’s unique to the first view, we would simply write down a model 
for 81 without making any parametric assumptions about 82 and 83, except that they are independent. 

In this paper, we develop a general framework of Rich Component Analysis (RCA) to explore such multi-component, 
multi-view datasets. Our framework allows for learning an arbitrarily complex model of a specific component of the 
data, 8i, without having to make parametric assumptions about other components 8j. This allows the analyst to focus 
on the most salient aspect of data analysis. The main conceptual contribution is the development of new algorithms to 
learn parameters of complex distributions without any samples from that distribution. In the two-view example, we do 
not observe samples from our model of interest, 81. Instead the observations from U are compositions of true samples 
from 81 with complex signal from another process 82 which is shared with V. Our approach performs consistent 
parameter estimation of without modeling 82, 83. 
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Outline. RCA consists of two stages: 1 ) from the observed data, extract all the cumulants of the component that 
we want to model; 2 ) using the cumulants, perform method-of-moments or maximum likelihood estimation (MLE) 
of model parameters via polynomial approximations to gradient descent. We introduce the relevant properties of 
cumulants and tensors in Section]^ In Section]^ we develop the formal models for Rich Component Analysis (RCA) 
and the cumulant extraction algorithms. We discuss how RCA differs from existing models. Section shows how 
to integrate the extracted cumulants with method-of-moments or stochastic gradient descent for MLE inference. We 
show the performance gains of RCA in Section]^ All the proofs are in the Appendix. 

2 Preliminaries 

In this section we introduce the basics of cumulants. Eor more information please refer to Appendix [A| Cumulants 
provide an alternative way to describe the correlations of random variables. Unlike moments, cumulants have the nice 
property that the cumulant of sum of independent random variables equals to the sum of cumulants. For a random 
variable AT G K the cumulant is dehned to be the coefficients of the cumulant generating function logE[e*^]. 

We can also dehne cross-cumulants which are cumulants for different variables (e.g. covariance). For n variables 
Xi,..., Xn, their cross-cumulant can be computed using the following formula: 

= ^(kl - 1)!(-1)W-1 n nil X^]. 

TT BG-tt iGB 

In this formula, tt is enumerated over all partitions of [f], | 7 r| is the number of parts and B runs through the list of parts. 
We also use Kt{X) = Kt{X ,..., X) when it’s the same random variable. 

We can similarly dehne cumulants for multivariate distributions. For random vector X G the f-th order 
cumulant (and f-th order moment) is an object in (a f-th order tensor). The (U, *t)-th coordinate of cumulant 
tensor is Kt{Xi^, Xi^, ...,Xi^). We often unfold tensors into matrices. Tensor T G unfolds into matrix M = 
unfoldlT) G . Cumulants have several nice properties that we summarize below. 

Fact Suppose Xi ,..., Xt are random variables in The f-th order cumulant Kt(Xi ,..., Xt) is a tensor in that 
have the following properties: 

1 . (Independence) If (Xi,..., Xt) and (Yi,..., Yt) are independent, then Kt(Xi-|-Yi,..., Xt+Yt) = Kt(Xi, ...,Xt) + 

Kt(Yi,...,Yt). 

2 . (Linearity) Kt(ciXi ,..., CtXt) = ciC2 • • • CtKt(Xi ,..., Xj), more generally we can apply arbitrary linear trans¬ 
formations to multi-variate cumulants (see Appendix [A). 

3 . (Computation) The cumulant Kt(Xi, X()can be computed in 0 {{tdy) time. 

The second order cross-cumulant, K2{X,Y) is equal to the covariance E[(X — E[X])(Y — E[Y])]. Higher cu¬ 
mulants measures higher-order correlations and also provide a measure of the deviation from Gaussianity-all 3 rd and 
higher order cumulants of Gaussian random variables are zero. 

3 Rich Component Analysis 

In this section, we show how to use cumulant to disentangle complex latent components. The key ideas and applica¬ 
tions of RCA are captured in the contrastive learning setting when there are two views. We introduce this model next 
and then show how to extend it to general settings. 

3.1 RCA for contrastive learning 

Recall the example in the introduction where we have two views of the data, formally. 


U = Si+S2,V = AS2+S3. 


( 1 ) 


2 


Here, Si, 82,83 G are independent random variables that can have complicated distributions; A G is an 

unknown linear transformation. The observations consist of pairs of samples {u, v). Each pair is generated by drawing 
independent samples Si ^ Si,i = 1 , 2,3 and adding these samples component-wise to obtain u = si + S2 and 
V = As2 + S3. Note that the same S2 shows up in both u and v, introducing correlation between the two views. We 
are interested in learning properties about Si, for example learning its maximum likelihood (MLE) parameters. For 
concreteness, we focus our discussion on learning although our techniques also apply to 82 and S3. We don’t have 
any samples from ^i. The observations of U involves a potentially complicated perturbation by 82. Our hope is to 
remove this perturbation by utilizing the second view V, and we would like to do this without assuming a particular 
model for 82 or S3. 

Note that the problem is inherently under-determined; it is impossible to find the means of Si, 82, S3 without 
any additional information. This is in some sense the only ambiguity, as we will see if we know the mean of one 
distribution it is possible to extract all order cumulants of Si, 82,83. For simplicity throughout this section we assume 
the means of 81,82, S3 are 0 (given the mean of any of 81,82,83, we can always use the means of U and V to 
compute the means of other distributions, and shift them to have mean 0 ). 

Determining linear transformation First we can hnd A by the following formula: 

A^ = unfold{Kji{V, U, U, U))^unfold{K4{V, U, U, V)). ( 2 ) 

Lemma 3.1. Suppose the unfolding of the 4 -th order cumulant unfold{K4{AS2, 82, 82,82)) has full rank, given the 
exact cumulants K4{V, U, U, U) and K4{V, U, U, V), the above algorithm finds the correct linear transformation A in 
time 0{d^). 

Intuitively, since only 82 appears in both U and V, the cross-cumulants K4{y, U, U, U) and K4{V, U, U, V) depend 
only on 82. Also, by linearity of cumulants we must have unfold{K4{V, U, U,V)) = unfold{K4{V, U, U, U))A^ (see 
Appendix |B.l[ ). In the lemma we could have used third order cumulants, however for many distributions (e.g. all sym¬ 
metric distributions) the third order cumulant is 0 . Most distributions satisfy the condition that unfold{K4{AS2,82, 82, 82)) 
is full rank, the only natural distribution that does not satisfy this constraint is the Gaussian distribution (where K4 is 
0 ). 


Extracting cumulants Even when the linear transformation A is known, in most cases it is still information the¬ 
oretically impossible to hnd the values of the samples si, S2, S3 as we only have two views. However, we can still 
hope to learn useful information about the distributions 81,82,83. In particular, we derived the following formulas to 
estimate the cumulants of the distributions: 


Kt( 5 i) = Kt{U)-Kt{U,U,...,U,A-^V), 
Kt{S2) = Kt{U,U,...,U,A-^V), 
ttt{S3) = Kt{V)-Kt{AU,V,V,...,V). 


( 3 ) 

( 4 ) 

( 5 ) 


Theorem 3.2. For all t 1 , E(^llCltlOfT,S compute the t-th order cumulants for Si, 82, S3 in time 0 {{tdY'^^) 

Proof of Theorem | 3 . 2 | relies on the fact that since only 82 appears in both U and V, the cross-cumulant Kt{U,U, ...,U, A~^V) 
captures the cumulant of 82. Moreover, by independence, Kt{U) = Kt(S'i) + Kt{S2), so we can recover Kt{Si) by 
subtracting off the estimated k{S2) (and similarly for ^((S'a)). When the dimension of U is smaller than the dimension 
of V and A G column rank, the above formula with pseudo-inverse A^ in place of A~^ still recovers 

all cumulants. In Appendix |B. 1 [ we prove that both the formulas for computing A and for extracting the cumulants 
are robust to noise. In particular, we give the sample complexity for learning A and Kt{Si) from samples of U and V, 
both are polynomial in relevant quantities. 

Given Kt{Si), we can use standard algorithms to compute moments of Si. Many learning algorithms are based on 


method-of-moments and can be directly applied (see Section 4 . 1 1. Other optimization-based algorithms can also be 
adapted (Section[ 42 ]). 
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3.2 General model of Rich Component Analysis 

We can extend the cumulant extraction algorithm in contrastive learning to general settings with more views and 
components. The ideas are very similar, but the algorithm is more technical in order to keep track of all the components. 
We present the intuition and the main results here and defer the details to Appendix |B. 2 | Consider a set of observations 
Ui,U2, ■ ■ ■ ,Uk G each is linearly related to a subset of variables Si, S2, ■ ■ ■, Sp G the variable Sj appears in 
a subset Qj C [A:] of the observations. That is, 

p 

VzG[fc] = ( 6 ) 

i=i 

where G are unknown linear transformations, and = 0 if i ^ Qj. For simplicity we assume all the 

linear transformations are invertible. The variable Sj models the latent source of signal that is common to the subset of 
observations {Ui\i G Qj}- The matrix models the transformation of latent signal Sj in view i. In order for the 
model to be identifiable, it is necessary that all the subsets Qj’s are distinct (otherwise the latent sources with identical 
Qj can be collapsed into one Sj). In the most general setting, we have a latent signal that is uniquely associated with 
every subset of observations. In this case, p = 2 ^ — 1 and {Qj} corresponds to all the non-empty subsets of [A:]. In 
some settings, only specific subset of views Ui share common signals and {Qj} can be a small set. We measure the 
complexity of the set system using the following notion: 

Definition 3.1 (L-distinguishable). We say a set system {Qj} is L-distinguishable, if for every set Qj, there exists 
a subset T C Qj of size at most L (called the distinguishing set) such that for any other set Qjfj' ^ j), either 
Qj C Qj' orT Qj,. 

For example, the set system of the contrastive model is {{!}, { 1 , 2 }, { 2 }} and it is 2 -distinguishable. Intuitively, 
for any set Qj in the set system, there is a subset T of size at most L that distinguishes Qj from all the other sets 
(except the supersets of Qj). We use Algorithmj^to recover all the linear transformations (for more details of 

the algorithm see Algorithmin Appendix). Algorithmtakes as input a set system {Qj} that captures our prior 
belief about how the datasets are related. When we don’t have any prior belief, we can input the most general {Qj} 
of size 2 ^ — 1 , which is A;-distinguishable. The algorithm automatically determines if certain variable Sj = 0 . In the 
algorithm, min(( 3 _,) is the smallest element of Qj. 


Algorithm 1 FindLinear 

Require: set system {Qj} that is i-distinguishable, L + 1 -th order moments 

repeat 

Pick a set Qj that is not a subset of any remaining sets 
Let T = {wi,W2, ...,wl} be the distinguishing set for Qj 
Compute cumulants for all i € Qj'. Mi = unfold{KL+i{Uwi, ■■■, U^^jUi). 
If Mmin Qj = 0 then the variable Sj = 0 ; continue the loop. 

Let A(®J) = for all i G Qj, A^®^) = 0 for all i Qj. 

Mark Qj as processed, subtract all the cumulants of Qj. 
until all sets are processed 


Lemma 3 . 3 . Given observations Ui’s as defined in Equation suppose the sets Qj’s are L-distinguishable, all the 
unknown linear transformations A*^®’-^) ’s are invertible, unfoldings unfold{KL^i{Sj)) is either 0 (if Sj = OJ or have 
full rank, then given the exact L -\- 1 -th order cumulants, Algorithm{I^outputs all the correct linear transformations 
A^'‘d') in time poly{L\, (dk)^). 

Once all the linear transformations A^®’-’^ are recovered, we follow the same strategy as in the contrastive analysis 
case l 3 .ll 

Theorem 3 . 4 . Under the same assumption as Lemma [ 3 A] for any t > L Algorithm^computes the correct t-th order 
cumulants for all the variables in time poly{{L -\- f)!, (dk)^'^^). 
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Note that in the most general case it is impossible to hnd cumulants with order t < L, because there can be many 
different variables S'j’s but not enough views. Both Algorithms andare robust to noise, with sample complexity 
that depends polynomially on the relevant condition numbers, and exponential in the order of cumulant considered. 
For more details see Appendix|B.2| 


3.3 Related models 

Independent component analysis (ICA)||4] may appear similar to our model, but it is actually quite different. In ICA, 
let s = [si,..., s„] be a vector of latent sources, where s^’s are one dimensional independent, non-Gaussian random 
variables. There is an unknown mixture matrix A and the observations are x = As. Given many samples the goal 
is to deconvolve and recover each sample s^. In our setting, each Si can be a high-dimensional vector with complex 
correlations. It is information-theoretically not possible to deconvolve and recover the individual samples s^. Instead 
we aim to learn the distribution Si ( 6 ) without having explicit samples from it. 

Another related model is canonical correlation analysis (CCA)||6l. The generative model interpretation of CCA is: 
there is a common signal z ^ N{0,1), and view-specihc signals ^ A(0, 1). Each view is then sampled 
according to N{A^'^'>z + where m index the view. CCA is equivalent to maximum likelihood 

estimation of in this generative model. In our framework, CCA corresponds to the very restricted setting where 
S'!, iS'2, S'3 are all Gaussians. RCA learns without making such parametric assumptions about S2 and S'3. Moreover, 
using CCA, it is not clear how to learn the distribution S*! if it is not orthogonal to the shared subspace 82 - In our 
experiments, we show that the naive approach of performing CCA (or kernel CCA) followed by taking the orthogonal 
projection leads to very poor performance. Factor analysis (FA)||5l also corresponds to a multivariate Gaussian model, 
and hence does not address the general problem that we solve. In FA, latent variables are sampled z ~ A(0, 1) and 
the observations are a:|z ^ A(/i + Az, T'). 

A different notion of contrastive learning was introduced in m. They focused on settings where there are two 
mixture models with overlapping mixture components. The method there applies only for Fatent Dirichlet allocation 
and Hidden Markov Models and requires explicit parametric models for each component. 


4 Using Cumulants in learning applications 


The cumulant extraction techniques of Section constructs unbiased estimators for the cumulants of Si. In this 
section we show how to use the estimated cumulants/moments to perform maximum likelihood learning of Si. For 
concreteness, we frame the discussion on the contrastive learning setting, where we want to learn ^i. For general 
RCA the method works when L (see Dehnition 3.1 1 is small or the distributions have specihc relationship between 
lower and higher cumulants. 


4.1 Method-of-Moments 

RCA recovers the cumulants of Si, from which we can construct all the moments of in time 0{{tdY). This 
makes it possible to directly combine RCA with any estimation algorithm based on the method-of-moments. Method- 
of-moments have numerous applications in machine learning. The simplest (and most commonly used) example is 
arguably principal component analysis, where we want to hnd the maximum variance directions in ^i. This is only 
related to the covariance matrix E[S'i5'7]. RCA removes the covariance due to S 2 and constructs an unbiased estimator 
of ], from which we can extract the top eigen-space. 

The next simplest model is least squares regression (FSR). Suppose the distribution S*! contains samples and labels 
{X, y) C X R, and only the samples are corrupted by perturbations, i.e. Y is independent of S 2 . FSR tries to hnd 
a parameter f] that minimizes E[(y — /3^ A)^]. The optimal solution again only depends on the moments of (A, Y): 
j3* = (E[AA^])“^E[yA]. Using the second-order cumulants/moments extracted from RCA , we can efficiently 
estimate /3*. 

Method-of-moment estimators, especially together with tensor decomposition algorithms have been successfully 
applied to learning many latent variable models, including Mixture of Gaussians (GMM), Hidden Markov Model, 
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Latent Dirichlet Allocation and many others (see mi). RCA can be used in conjunction with all these methods. We’ll 
consider learning GMM in Section]^ 

4.2 Approximating Gradients 

There are many machine learning models where it’s not clear how to apply method-of-moments. Gradient descent 
(GD) and stochastic gradient descent (SGD) are general purpose techniques for parameter estimation across many 
models. Here we show how to combine RCA with gradient descent. The key idea is that the extracted cumu- 
lants/moments of forms a polynomial basis. If the gradient of the log-likelihood can be approximated by a low- 
degree polynomial in Si, then the extracted cumulants from RCA can be used to approximate this gradient. 

Consider the general setting where we have a model T) with parameter 6 , and for any sample si the likelihood 
is C{9, Si). The maximum likelihood estimator tries to hnd the parameter that maximizes the likelihood of observed 
samples: 9* = argmaxE[log£(0, si)]. In many applications, this is solved using stochastic gradient descent, where 
we pick a random sample and move the current guess to the corresponding gradient direction: -f 

77(Ve log C{9, sf ^), where rjt is a step size and is the f-th sample. For convex functions this is known to converge 
to the optimal solution na. Even for non-convex functions this is often used as a heuristic. 

If the gradient of log-likelihood Velog£(0, si) is a low degree polynomial in si, then using the lower order 
moments we can obtain an unbiased estimator for E[Ve logC{9, S'!)] with bounded variance, which is sufficient for 
stochastic gradient to work. This is the case for linear least-squares regression, and its regularized forms using either 
£i or £2 regularizer. 

In the case when log-likelihood is not a low degree polynomial in Si, we approximate the gradient by a low 
degree polynomial, either through simple Taylor’s expansion or other polynomial approximations (e.g. Chebyshev 
polynomials, see more in IfTOll '). This will give us a biased estimator for the gradient whose bias decreases with the 
degree we use. In general, when the (negative) log-likelihood function is strongly convex we can still hope to hnd an 
approximate solution: 

Lemma 4.1. Suppose the negative log-likelihood function F{9) = — E[logE(0, S'!)] is ^.-strongly convex and H- 
smooth, given an estimator G{9) for the gradient such that ||G(0) — S/F{9)\\ < e, gradient descent using G{9) with 
step size ^ converges to a solution 9 such that ||0 — 0* |P < 

When high degree polynomials are needed to approximate the gradient, our algorithm requires number of samples 
that grows exponentially in the degree. 


Logistic Regression We give a specihc example to illustrate using RCA and low degree polynomials to simulate 
gradient descent. Consider the basic logistic regression setting, where the samples Si = {x,y) G x {0,1}, and 

the log-likelihood function is log£(0, si) = log -^—yrz- The gradient of the log-likelihood is: Vg \ogC{9, si) = 

(y- 


l-|-e» 


X. 


We can then approximate the function 


H-e' 


flT; 


using a low degree polynomial in 9^ x. As an example, we use 3rd 


degree Chebychev: 


H-e® ' 


0.5 -I- O.2450^x — O.O14(0^x)^. The gradient we take in each step is 


E[Vs logC{9, S'!)] « E[FX] - 0.5E[X] - 0.245E[X(6i^A)] -f 0.014E[X(6 |Ta:)3]. 

To estimate this approximation, we only need quadratic terms E[A(0^X)] and a projection of the 4-th order moment 
E[Ar(0^Ar)^]. These terms are computed from the projected 2nd and 4-th order cumulants of X that are extracted 
from the cumulants of U and V via Section Because of the projection these quantities are much easier to compute 
(in fact, they can be estimated in linear time). 
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Figure 1 : All the y-axis indicate mean squared error (MSB), a-e shows the tradeoff between sample size and MSB for 
the four algorithms in each of the five applications. f,g shows the convergence rate of SGD for the logistic and Ising 
models, h-j shows the tradeoff between perturbation strength and MSB. k shows the inference accuracy of A. Brror 
bars corresponds to standard deviation. 


5 Experiments 

In the experiments, we focus on the contrastive learning setting where we are given observations of [/ = S'! + 
S 2 and V — AS 2 + and the goal is to estimate the parameters for the Si distribution. Our approach can also 
learn the shared component S 2 as well as S^,- We tested our method in five settings, where corresponds to: low 
rank Gaussian (PCA), linear regression, mixture of Gaussians (GMM), logistic regression and the Ising model. The 
first three settings illustrate combining RCA with method-of-moments and the latter two settings requires RCA with 
polynomial approximation to stochastic gradient descent. In each setting, we compared the following four algorithms: 

1. The standard learning algorithm using the actual samples si ~ SiiO) to learn the parameters 9. This is the 
gold-standard, denoted as ‘true samples’. 

2. Our contrastive RCA algorithm using paired samples from U and V to learn Si (9). 

3. The naive approach that ignores S 2 and uses U to learn Si(9) directly, denoted as ‘naive’. 

4. Birst perform Canonical Correlation Analysis (CCA) on U and V, and project the samples from U onto the 
subspace orthogonal to the canonical correlation subspace. Then learn 5*1 from the projected samples of U. We 
denote this as ‘CCA’. 

In all five settings, we let S 3 be sampled uniformly from [—1,1]'^, where d is the appropriate dimension of S 3 . The 
empirical results are robust to other choices of 5*3 that we have tried, e.g. multivariate Gaussian or mixture of Gaus¬ 
sians. 
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Contrastive PCA. Si was set to have a principal component along direction vi, i.e. si ^ JV{0, vivj + a^I). S 2 was 
sampled from Unif([—1,1]"^) + V 2 vJ and vi, V 2 are random unit vectors in RCA constructs an unbiased estimator 
of ] from the samples of U and V. We then report the top eigenvector of this estimator as the estimated iti. 

We evaluate each algorithm by the mean squared error (MSE) of the inferred vi to the true vi. 


Contrastive regression. S*! is the uniform distribution, si ^ Unif([—1,1]“^) and y = /3^si + A/'(0,1). S 2 was 
sampled from Unif([—1,1]^^) + V 2 vJ and /3, V 2 are random unit vectors in Our approach gives unbiased estimator 
of ] from which we estimate (3 = (E[S'iS'7])“^E[yS'!]. All algorithms are evaluated by the MSE between the 

inferred /3 and the true /3. 


Contrastive mixture of Gaussians. Si is a mixture of d spherical Gaussians in si ^ J2t=i 

S 2 is also a mixture of spherical Gaussians, S 2 J2k=i RCA gives unbiased estimators of the third- 

order moment tensor, E[si (8) Si <8) Si]. We then use the estimator in Q to get a low rank tensor whose components 
correspond to center vectors, and apply alternating minimization (see 11) to infer Algorithms are evaluated by 
the MSE between the inferred centers ^nd the true centers 


Contrastive logistic regression. 

from Unif([—1,1]'^) + V 2 vJ, and 
mation to the SGD of logistic regression as in Section 
true /3. 


Let Si ~ Unif([—1, l]'^) and y = 1 with probability 


l+e- 


S 2 was sampled 


, V 2 are unit vecto rs in M“. We use the 4-th order Chebychev polynomial approxi- 
Evaluation is the MSE error between the inferred (3 and the 


4.2 


Contrastive Ising model. Let Si be a mean-zero Ising model on d-by-d grid with periodic boundary conditions. 
Each of the d3 vertices are connected to four neighbors and can take on values {±1}. The edge between ver¬ 
tices i and j is associated with a coupling Jij ~ Unif[—1,1]. The state of the Ising model, si, has probability 
where Z is the partition function. We let S 2 also be a d-by-d grid of spins where half of the 
spins are independent Bernoulli random variables and the other half are correlated, i.e. they are all 1 or all -1 with 
probability 0.5. We use composite likelihood to estimate the couplings Jy of Si, which is asymptotically consistent 
with MLE of the true likelihood lfT3l . Eor the gold-standard baseline (which uses the true samples si), we use the 
exact gradient of the composite likelihood. Eor RCA , we used the 4-th order Taylor approximation to the gradient. 
Evaluation is the MSE between the true Jij and the estimated J^ . 


Results. Eor the method-of-moment applications-PCA, linear regression, GMM-we used 10 dimensional samples 
for U and V. The tradeoff between inference accuracy (measured in MSE) and sample size is shown in the top row 
of Eigure[T] Even with just 100 samples, RCA performs significantly better than the naive approach and CCA. With 
1000 samples, the accuracy of RCA approaches that of the algorithm using the true samples from Si. It is interesting 
to note that projecting onto the subspace orthogonal to CCA can perform much worse than even the naive algorithm. 
In the linear regression setting, for example, when the signal of S 2 happens to align with (3, the direction of prediction, 
projecting onto the subspace orthogonal to S 2 loses much of the predictive signal. 

In the SGD settings, we used a 10 dimensional logistic model and a 5-by-5 Ising model (50 Jij parameters to 
infer). RCA also performed substantially better than the two benchmarks (Eigure[2d, e). In all the cases, the ac¬ 
curacy of RCA improved monotonically with increasing sample size. This was not the case for the Naive and CCA 
algorithms, which were unable to take advantage of larger data due to model-misspecification. In Eigure[T]f and g, 
we plot the learning trajectory of RCA over the SGD steps for representative runs of the algorithm with 1000 sam¬ 
ples. RCA converges to the final state at a rate similar to the true-sample case. The residual error of RCA is due to 
the bias introduced by approximating the sigmoid with low-degree polynomial. When many samples are available, a 
higher-degree polynomial approximation can be used to reduce this bias. 

We also explored how the algorithms perform as the magnitude of the signal in S 2 is increased compared to Si 
(Eigure[T]h-j) with fixed 1000 samples. In these plots the cc-axis measures the ratio of standard deviations of S 2 and 
S'! .At close to 0, most of the signal of U comes from 5'i, and all the algorithms are fairly accurate. As the strength 
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of the perturbation increases, RCA performs significantly better than the benchmarks, especially in the Ising model. 
Finally we empirically explored the sample complexity of the subroutine to recover the A matrix from the 4th order 
cumulants. Figurej^k shows the MSE between the true A (sampled ^ Unif[—1, and the inferred A as a function 

of the sample size. Even with 1000 samples, we can obtain reasonable estimates of A € 

Biomarkers experiment. We applied RCA to a real dataset of DNA methylation biomarkers. Twenty biomarkers 
(10 test and 10 control) measured the DNA methylation level (a real number between 0 and 1) at twenty genomic loci 
across 686 individuals ca. Each individual was associated with a binary disease status Y. Logistic regression on the 
ten test biomarkers was used to determine the weight vector, /?, which quantihes the contribution of the methylation 
at each of these ten locus to the disease risk. The other ten independent loci are control markers. Getting accurate 
estimates for the values of (3 is important for understanding the biological roles of these loci. In this dataset, all the 
samples were measured on one platform, leading to relatively accurate estimate of /3. In many cases samples are 
collected from multiple facilities (or by different labs). We simulated this within our RCA framework. We let be 
the original data matrix of the ten test markers across the 686 samples. We let be the original data matrix of the ten 
control markers in these same samples. We modeled S 2 as a mixture model, where samples are randomly assigned to 
different components that capture lab specific biases. The perturbed observations 3xeU = S 1 +S 2 and V = A5'2 + 53, 
i.e. U and V simulate the measurements for the test and control markers, respectively, when the true signal has been 
perturbed by this mixtures distribution of lab biases. We assume that we can only access U and V and do not know 
S 2 , i.e. where each sample is generated. Running logistic regression directly on U and the phenotype + obtained a 
MSE of 0.24 (std 0.03) between the inferred /? and the true f3 measured from directly regressing Si on +. Directly 
using CCA also introduce signihcant errors with MSE of 0.25 (std 0.02). Using all the control markers as covariates 
in the logistic regression, the MSE of the test markers’ f3 was 0.14 (std 0.03). In general, adding V as covariates to the 
regression can eliminate S 2 at the expense of adding S 3 , and can reduce accuracy when S 3 is larger than S 2 - Using 
our RCA logistic regression on U and V, we obtained signihcantly more accurate estimates of 6 , with MSE 0.1 (std 
0.03). See Appendix for more analysis of this experiment. 
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A More Tensor and Cumulant Notations 


In this section we introduce the notations and basics for tensors and cumulants. 

Matrix Notations For a matrix M G we use ||M|j to denote its spectral norm sup|| 3,||_2 ll-^ 2 ;||, ||M||i^ to 

denote its Frobenius norm ||M||i;’ = o'miniM) to denote its smallest singular value. 

When n > m and the matrix M has full column rank, we use to denote its Moore-Penrose pseudoinverse 
which in particular satisfy M = I. 

We also sometimes use the Kronecker product of matrices, for A G and B € B is a matrix in 

^mpxnq following block structure: 

AiiB A12B ■ ■ ■ Ai^nB 

A 2 ,iB A 2 . 2 B ■ ■ ■ A 2 ^nB 

A® B = 

Am,lB Ajyi 2B ' ' ' AjYi^nB 

The singular values of ^ 0 i? is just the product of singular values of A and B. 

Tensor Notations A tensor T G is a f-dimensional array, and is frequently used to represent higher order 
moments or cumulants. We index the elements in the tensor using a f-tuple (ii, Z 2 , it) G [<^]*- The entries of tensor 
product [ui 0 M 2 0 • • • is simply the product of corresponding entries nj=i We use tt®* to denote 

u® u ® ■ ■ ■ ® ut times. 

For a distribution X G the f-th order moment is a tensor E[Ar®^], whose (zi,Z 2 , ...,Zt)-th entry is equal to 
E[Xi^Xi 2 ■ ■ ■ -AiJ. Later we shall see cumulants can also be conveniently represented as tensors. 

A tensor can be viewed as a multi-linear form (just as a matrix M can be viewed as a bilinear form vFMv). For a 
tensor T we define T{Mi, M 2 , ■ ■ ■, Mt) to be 

t 

i=i 

This multi-linear form works well with the moment tensors, especially for matrices Mi , Mt we always have 

E[X®‘](Mi,..., Mt) = E[{MjX) ® {MjX)®---® {MJ AT)]. 

Often to simplify operations tensors are unfolded to become matrices. There can be many ways to unfold a tensor, 
but in this paper we mostly use a particular unfolding which makes the tensor into a matrix: 

Similar to matrices, we also define the Frobenius norm of tensors to be the £2 norm of all its entries, in particular 

\\T\\F = \\unfoldiT)\\F = 


Cumulants Cumulants provide an alternative way to describe the lower order correlations of a random variable. 
Unlike moments, cumulants have the nice property that the cumulant of sum of independent random variables equals 
to the sum of cumulants. Formally, for a random variable AT G M the cumulant is defined to be the coefficients of the 
cumulant generating function logE[e‘^] (Kt{X) is just tl times the coefficient in front of X‘). When the variables are 
different the cross-cumulants (similar to covariance) can similarly be defined, and it can be computed as: 

K*(Xi,...,Xt)=5](|7r|-l)!(-l)W-i ^7) 

-TT BGtt 
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In this formula, tt is enumerated over all partitions of [f], |7r| is the number of parts in partition and B runs through 
the list of all parts. 

Similarly, it is possible to define cumulants for multivariate distributions. For random variable X S 

,..., Xi^ ). This cross cumulant can be computed in a similar way as Equation (j 7 |, however the products should 
be replaced by tensor products and the ordering of coordinates is important when doing the tensor product. 

Fact Suppose Xi, ..., Xt are random variables in The f-th order cumulant Kt{Xi, ..., Xj) is a tensor in that 
have the following properties: 

1. (Independence) If (Xi,..., X^) and (Yi,..., Ft) are independent, then Kt{Xi-\-Yi ,..., Xf+Yf) = Kt{Xi, ...,Xt) + 
KtiY,,...,Yt). 

2. (Linearity) Kt{M^X^,MjXt) = K:t(Xi,..., X*)(Mi,..., M*). 

3. (Relation to Moments) The f-th order cumulant is a polynomial over the first f-th order moments. Similarly the t- 
th order moment is a polynomial over the hrst f-th order cumulants. Further both polynomials can be computed 
in 0{t\) time. Converting between hrst f-th order moments and cumulants for d-dimensional variables takes 
0 {{tdy) time. 

Intuitively, cumulants can measure how correlated two distributions are. The simplest case is K 2 (X, Y) which is 
equal to the covariance E[(X — E[X])(y — E[Y])], and is 0 only if the two variables are not correlated in second order. 
For more detailed introductions to cumulants see books like IS) . 


B Details for Section |3] 

In this section, we prove the equations and algorithms in Sectionj^indeed compute the desirable quantity, and further 
we give sample complexity bounds. 

B.l Contrastive Learning 

We hrst prove Equation Q computes the correct linear transformation. 

Lemma B.l (Lemma |3.1| restated). Suppose the unfolding of the 4-th order cumulant unfold{Ki{AS 2 , S 2 , S 2 , S 2 )) 
has full rank, given the exact cumulants K 4 {V,U,U,U) and K 4 (V,U,U,V), Equation ([^ the correct linear 
transformation in time 0 {df). 

Proof Since U = Si + S 2 and V = AS 2 + S 3 , we know 


K4(V, U, U, U) = K4{AS2 + S3, Si + S2,Si + 52, Si + S2) 

= K 4 { 0 , Si, Si, Si) + K 4 { AS 2 , S2, S2, S2) + ^4(53,0,0,0) 

= K4iAS2,S2,S2,S2). 

Here the second step uses the fact that cumulants are additive for independent variables, and third step uses the linearity 
of cumulants. 

Similarly, we know K4{V, U, U, V) = k,4{AS2, S2, S2,AS2) = k,4{AS2, S2, S2,S2){I,I,I,A^). 

For the unfoldings of these cumulants, we have 

unfold{cum4{V, U, U, V)) = unf old{cum4{V, U, U, U))A'‘ . 

Therefore when unfold{cum4{V, U, U, V)) has full rank we can compute A using pseudo-inverse. 

For the running time, the main computation is a pseudo-inverse and a matrix product for d^ x d matrices, both take 
0(d®) time. □ 
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Next we show given the linear transformation, it is possible to estimate the cumulants using Equations In 

fact, we can also avoid computing the cross-cumulants and work with just the cumulants of variables; 


Kt(S'i) 

,,,, Kt{U + A-^V) - Kt{U) - Kt{A-^V) 

- ^tiU) 2*-2 

(8) 

Rt{S2) 

Kt{U + A-^V) - Rt{U) - KtiA-^V) 

2* - 2 

(9) 

Rt{S3) 

'^t{AU -\-V) — Kt(AU) — Kt{y) 

- 2‘-2 

(10) 


'Ehcoi'cni (Theorem^^^^restated), Foy all f 1, Ec^uatiovis 0-0 or 0-@ compute the correct cumulants for 
Si, S2, S3 in time Moreover, ifV has dimension higher than U and A has full column rank, replacing 

A~^ by A'^ still gives correct cumulants. 

Proof The proof of Equations 0-0 is very similar to the previous lemma. Note that 

Kt{U,U, A ^V) = Kt{Si + S2, ■■■, Si + S2, S2 + A ^Ss) 

= KtiSi,...,Si, 0 ) + KtiS2,S2,S2,S2) + Kt(0, 

= Kt{S2). 


So we have Equation 0, and using the fact that Kt(U) = Kt{Si) + Kt{S 2 ) we get Equation 0. Equation 0 follows 
similarly. 

In order to get Equations 0-([lO|, first note that by the linearity of cumulants, we can write Kt{U + A~^V) as the 
sum of 2* terms; 

Kt{U + A ^V) = Kt{ziU + {1 — zi)A ^V, Z 2 U + [1 — Z 2 )A ^V, ztU + {1 — zt)A ^V). 

2G{0,1}* 

Among all these terms, one is equal to Kt{U), one is equal to Kt{A~^V), and all the other 2* — 2 terms are cross- 
cumulants that involve both U and V. Since S 2 is the only variable that appears in both U and A~^V, all the 
2* — 2 terms are equal to Kt{S 2 ), therefore we have Equation 0. Equation 0 again follows from the fact that 
Kt{U) = Kt{Si) + Kt{S 2 ), and Equation ( [Io| ) is very similar. 

The moreover part follows by directly replacing A~^ with A^ in the above argument. Note that in this case we can 
still find A because unf old{currn{AS 2 , 5*2,5'2, S 2 )) still has full rank as long as unf old{cumi{S 2 , <S' 2 , S' 2 , «S' 2 )) has 
full rank. 

Eor running time, the main bottleneck is computing the cumulants (which takes 0{{tdY) time), and then applying 
the matrix A to the cumulants (which takes 0{dd‘^^) time). □ 

Einally, we show the equations are robust under sampling noise. Eor that we use the following bounds on cumulants 

Fact (O) Eor any cross-cumulant Kt{Ui, ..., (7t), if all the variables have bounded norm \\Ui\\ < R, then the cumulant 
has Erobenius norm bounded by (tR)^. 

In practice we use fc-statistics ca to estimate the cumulants, the standard deviation of fc-statistics is bounded by a 
similar formula. 


Lemma B.3. Suppose the distributions Si,S 2 , S 3 have bounded radius R, the A-th order cumulant unfold^K^ (V, U,U,U)) 
has smallest singular value a 3 , matrix A has smallest singular value a a ond || A|| > 1, given A-th order cumulants that 
are e-close in Frobenius norm (and e <C R'^), the linear transformation A is recovered with accuracy e|| Apii^/crl. 
Given t-th order cross-cumulants of U,V that are et-close in Frobenius norm, the cumulants of Si can be computed 


with accuracy O 


ll^li 


- 


— ) using {3 I In particular, to estimate the cumulants of Si with accuracy rj the 


number of samples required is 
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Proof. First we show the algorithms are robust under perturbation. For that we need the fact that any 4-th order cross- 
cumulant with bounded variables always have Frobenius norm of order at most 0{R*). As a corollary we know the 
cross-cumulant K 4 ^{V, U, U, U) has norm at most 0(|j A|ji?'^) and k, 4 ^{V, C/, U, V) has norm at most 0(|| Let M 

be the noisy version of M = unfold{Ki{V, U, U, U)), by assumption and by standard matrix perturbation bounds, we 
know ||M'f — M'^Wp < 0(e/(T|). On the other hand, let N be the noisy version of iV = unfold{K 4 (y, U, C/, V)), we 
know jjiV — 7 V||f < e, therefore 

\\M^N - M^N\\p < 0{\\M^ - Mt||||Ar||^ + ||M||||7V - N\\f) < OieWAW^R^al). 

For computing the f-th order cumulant, the main source of error is applying A~^ to the cross cumulant Kt{U, ...,U,V) 
to get Kt{U,U, A~^V), as we don’t have the matrix A exactly. Since the norm cross-cumulant is always bounded 
by (fi?)‘|| A|j, we know when e is small enough the error is roughly (ignoring lower order terms) 

||i-i - A-i|| \\Kt{U ,.... U, 1/)||f + II A-i|| \\KtiU ,..., U, V) - KtiU ,..., U, f^)||F 


which is bounded by 


O 


eWAfRHtRY ^ et ^ 


afa 






Also, by the variance bounds for cumulants we know with Z samples, et < (fi?)‘|| A||/a/Z ande < 0{RY\A\Y / y/Z), 
therefore when Z = fd{{tR)Y*'\\A\f'^R}^/a\a\Tf'), the estimation of has desirable error. □ 


B.2 Rich component analysis 

We first give the algorithm for computing the linear transformations and then show it computes the correct quantities. 


Algorithm 2 FindLinear 

Require: set system {Qj} that is L-distinguishable, L + 1-th order moments 

repeat 

Pick a set Qj that is not a subset of any remaining sets 
Let T = {wi,W 2 , ...jWl} be the distinguishing set for Qj 
Compute cumulants for all i G Qj-. 

Ali — UTlf oldi^Kpj-i , ..., ^ Uf) 

l-QjCQi 


If = 0 (or aminiMminQj) IS too Small), then Sj = 0; continue the loop. 

Let for all i G Qj, = 0 for all i Q^. 

Mark Qj as processed, and let 

l-Qj(AQi 

until all sets are processed 


The main idea behind this algorithm is that Since we know the sets are L-distinguishable, if we start from maximal 
set Qj, there must be a distinguishing subset of size L that is only contained in Qj. Similar to the contrastive setting, 
if we consider a cross-cumulant that contains all the variables in this distinguishing set, then the resulting cumulant 
must only depend on this particular variable Sj . Further, using different last variable in the cross-cumulants (similar 
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to using K 4 ,{V, U, U, U) and Ki{V, U, U, V)) and exploit the linearity of the cumulants, we can recover the linear 
transformations. 

Note that without loss of generality we can assume j) = because otherwise we can replace Sj with 

the distribution S'' = 

Lemma B.4 (Lemma |3.3| restated). Given observations Ui’s as defined in Equation^ suppose the sets Qj’s are L- 
distinguishable, all the unknown linear transformations A^''’^'i’s are invertible, unfoldings unfold{cumL^i{Sj)) is 
either 0 (when Sj = Oj or have full rank, then given the exact L + 1-th order cumulants, Algorithm^outputs the 
correct linear transformations A^'^d) iyi poly{L\, 

Proof. We prove this by induction using the following hypothesis; 

For all the processed variables, Algorithm[^finds the correct linear transformations ^nd cumulant (Sj). 
This hypothesis is clearly true at the beginning of the algorithm (as no variables are processed). We now show the 
algorithm will compute the correct quantities for the next variable Sj. 

By the algorithm, we know the set Qj is not a subset of any remaining sets, and T is a distinguishing set. Therefore, 
for other remaining set, we know it cannot contain all the elements in T. Therefore, by linearity and additivity of 
cumulants, we know kl+i(C/u,i ,..., U^t, Ui)(i G Qj) will only depend on the variable Sj and some of the previously 
processed variables. In particular, 

= « l+i (A^^^ '^'>Sj , A^^^ S ,) 

l-QjCQi 

By induction hypothesis, we have already processed all the other terms related to Qi (I j and Qj C Qi), so we 
have the correct cumulants kl+i{Si) and linear transformations Those terms will be subtracted out during 

the algorithm. Therefore we know 

M, = unfold{KL+i{A^^'^’^'^S j, Sj, ...,A<‘^*-^^Sj,Ad’3^Sj)) 

= unfold{KL+i{A^^^’^'^S j, Sj, ..., Sj, Sj)){A’~^’^'>)^ 

In particular MminQj = unfold{KL+i{A^'^^d)g^^^(w 2 j)g.^ Sj)). Therefore, the algorithm computes 

the correct linear transformations if the matrix Mmin Qj has full rank. 

The fact that M^in Qj has full rank is implied by assumptions, because we can write this matrix as 

unfold(KL+i{A^'^^’^^Sj,A^'^-^’^">Sj, ...,A^'^^’i'^Sj,Sj)) = {A^^^’^'> 0 • • • 0 A<~^^'^y)unfold(KL+i{Sj)). 

Here (g) is the Kronecker product of matrices, and it is well-known that the Kronecker product of invertible matrices are 
still invertible. Since unfold{KL+i (Sj)) is either 0 or has full rank by assumption, we know we can either detect there 
is no component corresponding to set Qj, or have a matrix with full rank. In the latter case the correctness 

of the L -f 1-th order cumulant calculation then simply follows from the linearity of cumulants. 

Finally, we estimate the running time of the algorithm. Computing any cumulant can be done in poly(L!, d^) time. 
Finding the distinguishing set (by exhaustive search) takes no more than poly(fc^) time. The algorithm runs in at most 
p < 2^ iterations, each iteration computes a small number of cumulants and does small number of linear-algebraic 
calculations (which are all poly in (kd)^), so the total running time is at most poly(L!, (kd)^) □ 

Now we are ready to give the algorithm for computing cumulants and prove that it works. 

Theorem B.5 (Theorem |3.4| restated). Under the same assumption as Lemma [373] for any t > L Algorithm^computes 
the correct t-th order cumulants for all the variables in time poly({L 1)\, {dk)^d-'t)_ 

Proof. The proof of this lemma is very similar to the previous one. Again we prove the lemma by induction, under 
the following induction hypothesis; 

For all the processed variables, Algorithm|^finds the correct cumulant Kt{Sj). 
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Algorithm 3 ComputeCumulant 

Require: set system {Qj} that is L-distinguishable, order t>L 
Ensure: t-th order cumulant for all the variables 

Apply Algorithm|^to find remove all sets whose variables do not appear. 

repeat 

Pick a set Qj that is not a subset of any remaining sets 

Let T = {wi,W 2 , ■■■, wl} be the distinguishing set for Qj, let wl+i = wl +2 = ■ ■ ■ = Wt = wl- 
Mark Qj as processed, let 

(5,) = K, ((1,..., (1J 

l-QjCQi 


until all sets are processed 


This is clearly true before the main loop. We now show that the algorithm successfully compute the cumulant of 
the next variable. 

Similar as before, since wi, ...,Wt contains all the elements of a distinguishing set T, we know 


= Kt{Sj) + ..., 

l-.QjCQi 

i-QjCQi 

By induction hypothesis all the other terms are computed in previous iterations of the algorithm, so they are 
subtracted out. Therefore we get the first term which is equal to Kt{Sj). □ 

Finally we prove the sample complexity bounds. 

Lemma B.6. Suppose the distributions Sj’s have bounded radius R, the L + \-th order cumulant unfold{KLj.i (Sj)) 
has smallest singular value (t^, nonzero matrices A^'^’A hQg smallest singular value a a ond spectral norm at most || A||. 

Also, suppose the longest chain of subsets Qj-^ C Qj 2 C • • • C has length q. Given L + \-th order cumulants 
that are e-close in Frobenius norm, the linear transformation A is recovered with accuracy e{pLR\\A\\ /a a^^ 

Given t-th order cumulants that are e-close in Frobenius norm, the cumulants of Sj can be computed with accuracy 
e(pLi?|| All In particular, to estimate the cumulants of Si with accuracy n the number of samples 

required is Q.{{pLR\\A\\/a a(t)■ 

Proof We prove this by induction. For each variable Sj, let depth qj be the length of the longest chain such that Qj C 

Qji C • • • C Qqj-i- We shallprove that the A^^’-^^’s are recovered with accuracy = 0(e(p|| A|p^+^(Li?)‘+^/cr^^cr^)'Jj“^|| A||‘+^( 

and the L + 1-th cumulant is recovered with accuracy = 0{LeA,qj |1 AfminQj IIf/c^)- 

First we show the base case, when qj = 1 and therefore there is no other set that contains this set. In this case, A^'^’A 
is just equal to where the M’s are the unfoldings of L + 1-th order cross-cumulants (so we have them 

with accuracy e). By standard matrix perturbation bounds we know the error is bounded by e\\Mi\\p/aminiMnAnQj 
and we just need to bound the smallest singular value and Frobenius norm for the M’s. For M^ninQj, we know it 
is equal to a linear transformation of the unfolding of KLpi{Sj), therefore crminiMminQ ) > cta^c- Similarly we 
have ||Mi||i? < 0(||A||'^+^(Li?)^+^). Therefore ca.i = 0(e||A||^+^(Li?)^+^/(T^^cr^). When we compute the 
L 1-th order cumulant, the dominating term is applying the inverses of the A matrices we estimated, and we know 
= 0(Le^4||M„iinQj|FM) = 0{eL\\Af^+^{LR)'^^+^/a\^+^al). 
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Suppose we have shown this for all of Qj’s with small depth qj < u. For a set Qj with qj = u 1, when we 
compute the matrices M we need to subtract the cumulants of the previously computed variables. The number of such 
variables is at mostp, and each variable has an additional error of + LeA,qj-i\\A\\^{LR)^^^) < 

0{en^q^-i\\A\\^^^). This is our new error in estimating the cumulants. Therefore, by the same 

argument we have 5 . = 0(e„,q^-_ip|| = 0 {e{p\\A\\^^+'^{LK)^+^/a'j^alYi~^\\A\\^+'^{LR)^+'^ 

The rest of the proof follows from very similar induction on Algorithm]^ □ 


C Details for Section m 


In this section we prove Lemma 4.1 which shows for a strongly convex function, given a biased estimator for the 
gradient we can still hope to get close to its optimal solution. 


Lemma C.l (Lemma 4.1 restated). Suppose the negative log-likelihood function F{6) = —E[log'H(0, ^i)] is p- 
strongly convex and H-smooth, given an estimator G{6) for the gradient such that ||G(0) — Vi^(0)|j < e, gradient 
descent using G{9) with step size ^ converges to a solution 9 such that ||0 — 0*|p < 

Before proving this lemma we first introduce basic definitions for strongly convex functions. 


Definition C.l (/i-strongly convex). A function F(9) (whose second order derivatives exist) is p-strongly convex if 
for any two points 9,t we have 


F{9) > F{t) + {S7F(t),9 - r) + ||10 - rf. 


Definition C.2 (iL-smooth). A function F{9) (whose second order derivatives exist) is FI-smooth if for any two points 
9, T we have 

F{9) < F(t) + {VF{t),9- t) + file - rf. 

The proof of Lemma [4T| mostly follows from the approximate gradient framework in ||2|. For completeness we 
also give the proof here. 


Proof. Let 9* be the optimal point. First, by /x-strongly convexity we know 

{VF{9),9 - 9*) > F{9) - F{9*) + - e*f. 

On the other hand, by iL-smoothness we know 

F{9*) < minF(6» - ryVF(0)) < minF(a;) - r]\\\7F{9)\\^ + ^||VF(6l)f = F{x) - ;^||VF(6»)f. 
V V 2 2H 


Therefore 

{S7F{9),9 - 9*) > ^\\S7Fi9)r + f^\\9 - 94^. (11) 

Now we prove even when the gradient G{9) is only an approximation, the above equation still holds approximately. 

Claim C.l. 

{G(9),9- 9*) > ^\\G(9)4 + ^\\9- 0*f - 24/^. 

Proof We know 

{G(9), 9-9*) = {\7F{9),9 - 94 + {G(9) - VF(6»), 9 - 94 

> {S/F{9),9-94 -e\\9-94 

> {VF(9),9-94-^\\9-94^--. 

4 p 
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Also, ||G'(0)|p < 2\\G[0) — VF(0)|p + 2||VF(0)|p. Using these two inequalities in Equation (111, we get 


{G{6),e-B*) > (VU(0),0-0*)- 

> ^l|VF(0)f+ ^|10-0,f _ 



>^l|VF(0)f+ ^||0-0*f 


M 


□ 


The above Claim essentially matches the (a, /3, e)-approximate condition in ||2l. Now suppose the update rule is 

Q(t+l) ^ Q(t) _ _]_Q(^Q(t)y 


We can then prove convergence result: 

Claim C.2. 

ll0W_rf <(i-^)‘ll0(o)-0,f+ ^. 

4// /j,^ 

Proof. We prove this by induction. Assume this is true for step t (the base case f = 0 is trivial), then for the next step 
we have 


|g|(t+l) _ 0*||2 ^ pit) _ Q*p _ _ 0*) + 1 |j(^(5,(t))||2 


2H 


2H' 


<ll0W-rf-^(|l|0W-rf-^) 


< (1- -rf + —. 

- ^ ah’" " fiH 


Substituting in the bound for — 9*\f we get the exact claim. 


□ 


Therefore by carefully choosing the step size gradient descent quickly converges to a nearby point (in fact similar 
argument works as long as the learning rate is upper bounded by ^). Similar arguments can be proved for stochastic 
gradient with a small enough step size (depending on the variance). □ 


D Details for Section HI 

Ising model inference. Let 9 = { Jy }, the composite log-likelihood Id of this Ising model can be written as 

PiS\9) 


ld(.9) = Eg 


H log 

ieVert 


P{S, S{i) = l\9) + P{S, S{i) = -1\9) 
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The gradient of the composite log-likelihood with respect to a particular Jij is 


_ ‘2S{i)S{j) _^_ 2S{i)S{j) _ 

1 -I- exp(2S'(i) X]/ceneigh(i) JikS{k)) 1 + exp(2S'(j) X]feeneigh(i)jjfcS(/c)) 

2S{{)S{j) - Sii)^S{j) Y. '^-kS{k) - Sii)Sijf Y J^kSik) 
feeneigh(i) feeneigh(i) 

where we have used the 4-th order Taylor expansion. We then use the 2nd and 4th order cumulant tensors from U and 
V to obtain unbiased estimator of terms E 5 [S'(i)S'(j)] and Es[S{i)'^S{j)S{k)]. This gives the approximate gradient 
used in SGD. In the experiments, we used batch size of 100 samples to approximate each step of the gradient. 

Biomarkers experiment. In the simulation for having samples from multiple facilities, we get two views of the 
data, where U = 81+82 and V = 82 + 83 . Here 81 represents the values for test markers, 83 represents the values 
for control markers, and ( 5 ' 2 , 82 ) jointly represents the perturbation caused by different labs. 

In our set up, we assume the samples come from two different labs, each lab has a bias on all the 20 markers (we 
use (p^, q^) G x to denote the biases). That is 

(Q c't — / sample from lab 1 

( 2 , 2 ) — sample from lab 2 

In this case, when the vectors p, q are in general positions, it is easy to see that there is a rank-2 matrix A such that 
82 = A 82 . In particular, let P G be the matrix whose columns are p^,p^, Q G be the matrix whose 

columns are q^, q^, then we know A = 

Note that this does not fit directly in our framework as the distribution 82 has low rank, and therefore the 4-th order 
cumulant cannot have full column rank. However, we can consider 82 = PX 2 and 82 = QX 2 (where X 2 = (1,0) 
for samples from lab 1 and X 2 = (0,1) for samples from lab 2). We show that the algorithm still makes sense in this 
setting. 

Let W = unfold{K 4 {QX 2 , PX 2 , PX 2 ,X 2 )) G Kiooox 2 ^ linearity of cumulants, we know 

Ml = unfold{K 4 {V, U, U, U)) = WP^, 

M 2 = unfold{K 4 {V, U, U, V)) = WQ^. 

In this case. Mi does not have full column rank, so the usual definition of pseudo-inverse does not work. However, 
we can write P = ZR where Z G is an orthonormal matrix {Z^Z = /), and R G and then hope to find 

m\ such that M\Mi = ZZ"' (this is possible because we can let m\ := Z{WRA)"^). 

When we use this definition of pseudo-inverse, it is easy to check that {m\M 2 )^ = QR~^Z^ = QP\ therefore 
our algorithm can still recover the correct rank-2 A matrix. 


Vj.j Id = Es 
~ Eg 


19 



